CUDA errors after driver update

396 views
Skip to first unread message

ikenned...@hotmail.com

unread,
Jan 1, 2019, 2:30:45 PM1/1/19
to LCZero
LC was working fine on a 1050 laptop up to v19. Recently downloaded v20 which enforced a driver update, latest was 417.35. This caused CUDA to go down after leela started playing. The time (games/moves/thinking time) to failure varies but from observing with Afterburner it appears that it is going down a few seconds after Leela finishes a calculation, never during. The best run I've seen was 30 games in a gauntlet and worst is one move in game 1.

After that it simply reports
error CUDA error: no CUDA-capable device is detected

and a reboot is required. Other apps like CUDA-Z give a similar message.

I tried rolling back the driver update then installing 411.63 which was about the oldest that satisfied the check in leela but it has made no difference.

I have just tried 20.1 OpenCL and this runs fine - I get about 400-500nps v about 2knps with the CUDA version.

I can reproduce the problem from the command line just using
go nodes 10000

it does one calc then it 'goes to sleep' and I have to kill the app, on restart I get the same cuda failure. Also tried running under the Shredder GUI instead of normal Arena, same prob.

I am using all default settings so no command line options or added uci options.

Networks currently a mix of 11198 and 32300.

I can't realistically go right back to the original driver otherwise I'd be stuck at v19...the last and only time I've done a driver update was Sep 6th when I installed 397.93.

ikenned...@hotmail.com

unread,
Jan 3, 2019, 2:46:20 AM1/3/19
to LCZero
Don't have a solution to this yet but found the nvidia-smi diagnostic utility and this is its output immediately after completing the 'go nodes 100' command then a second time a few moments later:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 417.35       Driver Version: 417.35       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   27C    P0    N/A /  N/A |    886MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3916      C   ...hess\_Engines\lcz\v20RC1_N32300\lc0.exe N/A      |
+-----------------------------------------------------------------------------+

< a moment later ...>
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost.  Reboot the system to recover this GPU

This was a new laptop bought in September as a dedicated Leela test machine so right now its money down the drain.

gvergh...@gmail.com

unread,
Jan 3, 2019, 5:10:03 PM1/3/19
to LCZero
Search for DDU.
The fix has been covered already.

ikenned...@hotmail.com

unread,
Jan 5, 2019, 3:14:33 PM1/5/19
to LCZero
Where exactly has 'The fix ...been covered already'?

I found an Nvidia thread from someone with the exact same laptop as me and he was told to use DDU then install the Lenovo OEM version of the driver not the Nvidia website ones. Unfortunately the only version they have is way too old. I have been through the steps he listed though and am no further forward - either straight away or after subsequently trying to re-upgrade it to 417.35 via Nvidia direct. In fact Afterburner now says I don't have an Nvidia at all, whilst nvidia-smi and cuda-z disagree.

https://forums.geforce.com/default/topic/1070376/geforce-drivers/nvidia-geforce-gtx-1050-in-a-lenovo-legion-y520-laptop/

This led me to

Nvidia_VGA_23.21.13.9125_NonHD
20 Nov 2018

and after installing that, Device Manager not surprisingly informed me my driver was now dated 16.3.2018.

ikenned...@hotmail.com

unread,
Jan 5, 2019, 3:29:38 PM1/5/19
to LCZero
OK sorry I see your post in this forum about DDU but please see the GeForce forum post I cited. I guess I could re-do it (DDU) but go straight to the Nvidia driver not the Lenovo one but that greybear seems adamant I should stick to Lenovo.

gvergh...@gmail.com

unread,
Jan 7, 2019, 1:27:00 PM1/7/19
to LCZero
Then stick with greybear :)
Reply all
Reply to author
Forward
0 new messages