Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

OliverBr
Posts: 725
Joined: Tue Dec 18, 2007 9:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch

Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by OliverBr »

Hello together,

I am trying to play a couple of hundred games on a 16-core Linux server with cutechess-cli.
It works flawlessy with any version of OliThink, but when Fruit or Gnuchess (5 or 6) are playing the tournament, it will abruptly end after a unknown time with an error "Terminating process of engine XboardEngine(22) - White's connection stalls".
It's always the XboardEngine (Fruit/Gnuchess) which will be terminated and whose connection stalls.

Do you have any idea what's happening? I guess it's a bug in the XboardEngine that's only coming forth on such system?! I compiled them on the target system.

Here the output.

Code: Select all

Finished game 86 (XboardEngine vs OliThink 5.5.8): 0-1 {Black mates}
Score of OliThink 5.5.8 vs XboardEngine: 76 - 5 - 6  [0.908] 87
Started game 103 of 1000 (OliThink 5.5.8 vs XboardEngine)
Finished game 80 (XboardEngine vs OliThink 5.5.8): 1/2-1/2 {Draw by insufficient mating material}
Score of OliThink 5.5.8 vs XboardEngine: 76 - 5 - 7  [0.903] 88
Started game 104 of 1000 (XboardEngine vs OliThink 5.5.8)
Terminating process of engine XboardEngine(22)
Finished game 90 (XboardEngine vs OliThink 5.5.8): 0-1 {White's connection stalls}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 104 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 92 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 100 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 98 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 91 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 89 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 95 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 103 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 99 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 96 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 93 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 102 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 94 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 101 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 97 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Elo difference: 390.55 +/- 120.52
Finished match
And here is the conifugration:

Code: Select all

#uname -a
Linux rescue 5.4.47 #1 SMP Thu Jun 18 07:22:31 UTC 2020 x86_64 GNU/Linux

#lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen Threadripper 2950X 16-Core Processor
Stepping:            2
CPU MHz:             1888.997
CPU max MHz:         3500.0000
CPU min MHz:         2200.0000
BogoMIPS:            6999.16
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink
jswaff
Posts: 105
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by jswaff »

My guess is the engine has crashed. (I've seen that error before and that was the case.) Try opening a window with the list of processes and see if one falls off when you see that message.

Running cutechess with the debug option may give some more info.
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by chrisw »

OliverBr wrote: Tue Jul 21, 2020 6:54 pm Hello together,

I am trying to play a couple of hundred games on a 16-core Linux server with cutechess-cli.
It works flawlessy with any version of OliThink, but when Fruit or Gnuchess (5 or 6) are playing the tournament, it will abruptly end after a unknown time with an error "Terminating process of engine XboardEngine(22) - White's connection stalls".
It's always the XboardEngine (Fruit/Gnuchess) which will be terminated and whose connection stalls.

Do you have any idea what's happening? I guess it's a bug in the XboardEngine that's only coming forth on such system?! I compiled them on the target system.

Here the output.

Code: Select all

Finished game 86 (XboardEngine vs OliThink 5.5.8): 0-1 {Black mates}
Score of OliThink 5.5.8 vs XboardEngine: 76 - 5 - 6  [0.908] 87
Started game 103 of 1000 (OliThink 5.5.8 vs XboardEngine)
Finished game 80 (XboardEngine vs OliThink 5.5.8): 1/2-1/2 {Draw by insufficient mating material}
Score of OliThink 5.5.8 vs XboardEngine: 76 - 5 - 7  [0.903] 88
Started game 104 of 1000 (XboardEngine vs OliThink 5.5.8)
Terminating process of engine XboardEngine(22)
Finished game 90 (XboardEngine vs OliThink 5.5.8): 0-1 {White's connection stalls}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 104 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 92 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 100 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 98 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 91 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 89 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 95 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 103 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 99 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 96 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 93 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 102 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 94 (XboardEngine vs OliThink 5.5.8): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 101 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Finished game 97 (OliThink 5.5.8 vs XboardEngine): * {No result}
Score of OliThink 5.5.8 vs XboardEngine: 77 - 5 - 7  [0.904] 89
Elo difference: 390.55 +/- 120.52
Finished match
And here is the conifugration:

Code: Select all

#uname -a
Linux rescue 5.4.47 #1 SMP Thu Jun 18 07:22:31 UTC 2020 x86_64 GNU/Linux

#lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen Threadripper 2950X 16-Core Processor
Stepping:            2
CPU MHz:             1888.997
CPU max MHz:         3500.0000
CPU min MHz:         2200.0000
BogoMIPS:            6999.16
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
I gave up fighting it, all was too random. Just reduce the number of cores you use by one until it goes back to being reliable again.
Some of my systems are okay with running N games in parallel on N cores and one of them is only good at N-1 games.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by jdart »

I run cutechess with xboard engines all the time. "Connection stalls" or "disconnects" messages are almost always engine bugs. Note too some engines cannot handle very fast time controls.
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by chrisw »

jdart wrote: Tue Jul 21, 2020 8:52 pm I run cutechess with xboard engines all the time. "Connection stalls" or "disconnects" messages are almost always engine bugs. Note too some engines cannot handle very fast time controls.
Almost always is correct, but. Cuteness makes its own decisions about disconnection failure and I convinced myself that pushing *some* systems running on all cores is too much.
OliverBr
Posts: 725
Joined: Tue Dec 18, 2007 9:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by OliverBr »

I have some more information:
Accidentally, I ran Fruit 2.1 with Xboard-Protocol which may have caused the problems. With UCI it works fine.

As gnuchess and Fruit are somehow siblings, I guess their xboard protocol is the same, so is the bug.
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by hgm »

Fruit 2.1 does not support XBoard protocol. AFAIK GNU Chess 6 is a version of Fruit that combines a version of Polyglot hacked to use the GNU dialect fo XBoard protocol and the Fruit engine in one executable. It might be multi-threaded, using separate threads to run the adapter and the engine.
OliverBr
Posts: 725
Joined: Tue Dec 18, 2007 9:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by OliverBr »

hgm wrote: Wed Jul 22, 2020 6:42 am Fruit 2.1 does not support XBoard protocol. AFAIK GNU Chess 6 is a version of Fruit that combines a version of Polyglot hacked to use the GNU dialect fo XBoard protocol and the Fruit engine in one executable. It might be multi-threaded, using separate threads to run the adapter and the engine.
Thank you. UCI Fruit works well and reliably. It's possible to run 32 concurrent games on an AMD EPYC 7502P 32-Core Processor without any issues.
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Cutechessing with Fruit/Gnuchess on a 16-core errors with "connection stalls"

Post by lucasart »

OliverBr wrote: Wed Jul 22, 2020 8:17 pm
hgm wrote: Wed Jul 22, 2020 6:42 am Fruit 2.1 does not support XBoard protocol. AFAIK GNU Chess 6 is a version of Fruit that combines a version of Polyglot hacked to use the GNU dialect fo XBoard protocol and the Fruit engine in one executable. It might be multi-threaded, using separate threads to run the adapter and the engine.
Thank you. UCI Fruit works well and reliably. It's possible to run 32 concurrent games on an AMD EPYC 7502P 32-Core Processor without any issues.
Actually, I found that Fruit loses on time quite a lot, especially when you remove adjudication.

If you're using UCI, you can try c-chess-cli. Trivial to compile and use (see readme). It has better multi-threaded logging. All I/O is logged per thread, and timeout information enforced by master threads is also logged:
https://github.com/lucasart/c-chess-cli
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.