Lc0 crash

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Lc0 crash

Post by zullil »

Code: Select all

0809 10:40:47.826601 140535458146048 ../../src/chess/uciloop.cc:218] << info string e5f6  (785 ) N: 76779030 (+155) (P: 13.18%) (Q:  0.03640) (D:  0.119) (U: 0.00031) (Q+U:  0.03671) (V:  0.0276) 
0809 10:40:49.980398 140535441360640 /home/louis/Documents/Chess/lc0/src/utils/exception.h:39] Exception: CUDA error: an illegal memory access was encountered (../../src/neural/cuda/network_cudnn.cc:601) 

Code: Select all

Aug  9 10:40:49  kernel: [16848.854627] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 1): Out Of Range Address
Aug  9 10:40:49  kernel: [16848.854637] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Global Exception on (GPC 2, TPC 3, SM 1): Multiple Warp Errors
Aug  9 10:40:49  kernel: [16848.854641] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x515fb0=0xc12000e 0x515fb4=0x24 0x515fa8=0x4c1eb72 0x515fac=0x174
Aug  9 10:40:49  kernel: [16848.855375] NVRM: Xid (PCI:0000:03:00): 43, Ch 00000030

Code: Select all

0809 10:07:53.252346 140536039936448 ../../src/neural/factory.cc:115] Loading weights file from: /home/louis/Documents/Chess/lc0/build/release/network42850
0809 10:07:53.822033 140536039936448 ../../src/neural/factory.cc:84] Creating backend [cudnn-fp16]...
0809 10:07:53.922344 140536039936448 ../../src/neural/cuda/network_cudnn.cc:699] GPU: GeForce RTX 2080 Ti
0809 10:07:53.922407 140536039936448 ../../src/neural/cuda/network_cudnn.cc:700] GPU memory: 10.7534 Gb
0809 10:07:53.922436 140536039936448 ../../src/neural/cuda/network_cudnn.cc:702] GPU clock frequency: 1635 MHz
0809 10:07:53.922451 140536039936448 ../../src/neural/cuda/network_cudnn.cc:703] GPU compute capability: 7.5
0809 10:07:53.922467 140536039936448 ../../src/neural/cuda/network_cudnn.cc:710] CUDA Runtime version: 10.1.0
0809 10:07:53.922478 140536039936448 ../../src/neural/cuda/network_cudnn.cc:723] Cudnn version: 7.6.2
0809 10:07:53.922488 140536039936448 ../../src/neural/cuda/network_cudnn.cc:737] Latest version of CUDA supported by the driver: 10.1.0
0809 10:08:02.140397 140536039936448 ../../src/chess/uciloop.cc:131] >> ucinewgame
I'm using NNCacheSize = 100000000. I have 128 GB of RAM. According to Lc0's own log file, this should have been plenty enough for the search I was running:

Code: Select all

0809 10:08:13.744528 140536039936448 ../../src/engine.cc:177] RAM limit 100000MB. Cache takes 31200MB. Remaining memory is enough for 344000000 nodes.
0809 10:08:13.744541 140536039936448 ../../src/engine.cc:372] Limits: visits:100000000 playouts:-1 depth:-1 infinite:0
Any ideas about what might be going on here would be much appreciated. I'm just starting with Lc0, which is quite fascinating---when it works.
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Lc0 crash

Post by mwyoung »

The longest single position time I have run was 5 1/2 hours. And It ran this with 64 Gb of ram without issues on a 2080 ti, and at some point started using my NVME drive as ram without issues. :shock:

And I have run Lc0 in match play for over a week without issues.

Did you log your GPU temp?

I will tell you the only issue I have had running Lc0 is when I try to Overclock my GPU. Even If I overclock my GPU only by 1%. Lc0 will crash at some point in time. If I do not overclock my GPU. Lc0 has never crashed.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Lc0 crash

Post by zullil »

mwyoung wrote: Sat Aug 10, 2019 4:47 am The longest single position time I have run was 5 1/2 hours. And It ran this with 64 Gb of ram without issues on a 2080 ti, and at some point started using my NVME drive as ram without issues. :shock:

And I have run Lc0 in match play for over a week without issues.

Did you log your GPU temp?

I will tell you the only issue I have had running Lc0 is when I try to Overclock my GPU. Even If I overclock my GPU only by 1%. Lc0 will crash at some point in time. If I do not overclock my GPU. Lc0 has never crashed.
I didn't log the GPU temp. I've been sporadically checking it myself during searches, and I've never seen it above 73 C. But I have been using Nvidia's X Server Settings app to give a small positive offset to the graphics clock. Perhaps this is the culprit. I'll return to the default setting and see if the crashes stop. Thanks.
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Lc0 crash

Post by smatovic »

zullil wrote: Fri Aug 09, 2019 5:09 pm

Code: Select all

0809 10:40:47.826601 140535458146048 ../../src/chess/uciloop.cc:218] << info string e5f6  (785 ) N: 76779030 (+155) (P: 13.18%) (Q:  0.03640) (D:  0.119) (U: 0.00031) (Q+U:  0.03671) (V:  0.0276) 
0809 10:40:49.980398 140535441360640 /home/louis/Documents/Chess/lc0/src/utils/exception.h:39] Exception: CUDA error: an illegal memory access was encountered (../../src/neural/cuda/network_cudnn.cc:601) 
...
I am not into CUDA or LC0 code, but this looks like a software bug and not a hardware issue....

--
Srdja
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Lc0 crash

Post by zullil »

smatovic wrote: Sat Aug 10, 2019 12:03 pm
zullil wrote: Fri Aug 09, 2019 5:09 pm

Code: Select all

0809 10:40:47.826601 140535458146048 ../../src/chess/uciloop.cc:218] << info string e5f6  (785 ) N: 76779030 (+155) (P: 13.18%) (Q:  0.03640) (D:  0.119) (U: 0.00031) (Q+U:  0.03671) (V:  0.0276) 
0809 10:40:49.980398 140535441360640 /home/louis/Documents/Chess/lc0/src/utils/exception.h:39] Exception: CUDA error: an illegal memory access was encountered (../../src/neural/cuda/network_cudnn.cc:601) 
...
I am not into CUDA or LC0 code, but this looks like a software bug and not a hardware issue....

--
Srdja
Thanks. I agree. A quick look at the code suggests incorrect memory accessing during the moving of data from the GPU to system RAM.